feat: add complete automation infrastructure for plot generation#4
Conversation
- Create detailed 13-phase plan for first POC version - Include manual workflow orchestration via Claude Code - Cover matplotlib and seaborn implementations - Define self-review loop, multi-version testing, and preview optimization - Specify GCS upload, PostgreSQL metadata, API, and minimal frontend - Provide code examples, scripts, and troubleshooting guide - Estimate 4-5 hours total implementation time This plan validates the end-to-end workflow before automating with GitHub Actions.
Core Components: - Spec template (.template.md) for consistent plot specifications - Example spec: scatter-basic-001.md - Plot generator with Claude + versioned rules + self-review loop - Three GitHub Actions workflows for full automation Workflows: 1. spec-to-code.yml: Auto-generates code when issue gets 'approved' label - Extracts spec from issue - Generates matplotlib + seaborn implementations - Self-review loop (max 3 attempts) - Creates PR automatically 2. test-and-preview.yml: Tests code and generates preview images - Multi-version testing (Python 3.10-3.13) - Generates preview PNGs - Uploads to GCS - Comments on PR with preview links 3. quality-check.yml: AI quality evaluation with Claude Vision - Downloads previews from GCS - Evaluates against spec quality criteria - Scores each implementation (0-100) - Comments with detailed feedback - Adds labels (quality-approved or quality-check-failed) This infrastructure enables: - Complete automation from GitHub Issue → Code → Test → Preview → Quality - Self-review and quality gates built-in - No manual steps required for plot generation - Production-ready code output Next: Test with scatter-basic-001 issue
Add version tracking to spec files for maintainability and automated upgrades. Changes: - Add version marker to spec template (v1.0.0) - Update scatter-basic-001.md with version info - Create upgrade_specs.py script for automated spec upgrades - Add specs/VERSIONING.md documentation - Update plot_generator.py to check and display spec version Benefits: - Easy template evolution without breaking existing specs - Automated migration when template improves - Clear tracking of which template version each spec uses - Version upgrade can be batch-processed Usage: # Check what would change python automation/scripts/upgrade_specs.py --dry-run # Upgrade all specs to latest version python automation/scripts/upgrade_specs.py # Upgrade specific spec python automation/scripts/upgrade_specs.py --spec scatter-basic-001 This enables continuous improvement of the spec template while maintaining backward compatibility.
Add Claude-powered upgrade system for intelligent spec improvements beyond
structural changes.
New Features:
- upgrade_specs_ai.py: AI-powered spec upgrader using Claude
- Semantic improvements (better wording, clearer criteria)
- Preserves spec ID and core intent
- Automatic backup creation (.backup-{version} files)
- Dry-run mode for preview
- Single spec or batch upgrade
- specs/upgrades/: Directory for version-specific upgrade instructions
- Custom instructions per version transition
- Examples and before/after comparisons
- Rationale for changes
- Updated VERSIONING.md with AI upgrade documentation
Why AI-Powered Upgrades?
- Structural changes (new sections) are easy with regex
- Semantic improvements (better wording, specificity) need understanding
- AI can reformulate quality criteria to be measurable
- AI can enhance parameter descriptions with types and ranges
- AI preserves intent while improving clarity
Usage:
export ANTHROPIC_API_KEY=sk-ant-...
# Preview changes
python automation/scripts/upgrade_specs_ai.py --dry-run
# Upgrade all specs (with backups)
python automation/scripts/upgrade_specs_ai.py --version 1.0.0
# Upgrade single spec
python automation/scripts/upgrade_specs_ai.py --spec scatter-basic-001
Combined Strategy:
1. upgrade_specs.py - For structural changes (fast, no API calls)
2. upgrade_specs_ai.py - For semantic improvements (AI-powered)
This enables continuous spec quality improvement as we learn better
documentation patterns.
Replace direct Anthropic API calls with Claude Code Action for AI tasks.
This allows the workflows to use the existing CLAUDE_CODE_OAUTH_TOKEN
instead of requiring a separate ANTHROPIC_API_KEY.
Changes:
spec-to-code.yml:
- ✅ Trigger remains label-based ('approved' label on issue)
- ✅ Spec extraction remains unchanged
- ✅ Code generation now uses Claude Code Action
- ✅ Provides detailed prompt with rules and requirements
- ✅ Claude Code reads specs, rules, and generates implementations
- ✅ Self-review loop integrated into Claude Code execution
- ✅ Automatic commit and PR creation
quality-check.yml:
- ✅ Trigger remains workflow_run based
- ✅ Preview image download from GCS unchanged
- ✅ Quality evaluation now uses Claude Code Action
- ✅ Claude Code views images (Vision) and evaluates against specs
- ✅ Uses gh CLI to comment on PR and add labels
- ✅ Scores each implementation (0-100, ≥85 to pass)
Benefits:
- No separate API key needed
- Uses existing Claude Code Max subscription
- Integrated with Claude Code ecosystem
- Cleaner, more maintainable code
- Claude Code handles commits automatically
- Better error handling and retries
Hybrid Approach:
- Label triggers → Automation starts
- Claude Code → AI-powered tasks (generation, evaluation)
- Regular Actions → Infrastructure (tests, GCS upload, PR creation)
This enables full automation using only CLAUDE_CODE_OAUTH_TOKEN
and GCS_CREDENTIALS secrets.
Make automation workflows visible and trackable in Claude Code Web by using @claude triggers instead of direct API calls. This allows you to see the workflow progress live in your Claude Code Web session. Changes: 1. spec-to-code.yml (Code Generation): - Simplified to just trigger @claude with instructions - GitHub Action posts @claude comment when 'approved' label is added - Claude Code (via claude.yml) picks up the task - User sees generation progress in Claude Code Web - Claude Code creates implementations, commits, and makes PR 2. quality-check.yml (Quality Evaluation): - Posts @claude comment with evaluation instructions - Claude Code views preview images and evaluates - User sees evaluation progress in Claude Code Web - Claude Code posts results and adds labels 3. claude.yml (Core): - Enhanced permissions: contents:write, pull-requests:write, issues:write - Allows Claude Code to commit and create PRs - Fetch full history for better context Workflow Flow: 1. Issue gets 'approved' label 2. spec-to-code.yml posts @claude comment 3. claude.yml triggers (sees @claude) 4. Claude Code executes in Web (YOU SEE THIS LIVE!) 5. Claude Code commits implementations 6. Claude Code creates PR 7. test-and-preview.yml runs (regular CI) 8. quality-check.yml posts @claude comment 9. Claude Code evaluates (YOU SEE THIS LIVE!) 10. Claude Code posts quality report + labels Benefits: - 👁️ Full visibility in Claude Code Web - 🎮 Can intervene if needed - 📊 See progress in real-time - 🔍 Better debugging when things go wrong - 💬 Natural conversation-based workflow - 🤖 Same AI throughout (Claude Code Max)
Pull Request Review: Complete Automation Infrastructure🎯 OverviewThis PR adds comprehensive automation infrastructure for AI-powered plot generation. The approach is well-architected and aligns with the project's specification-first philosophy. Here's my detailed review: ✅ StrengthsArchitecture & Design
Code Quality
🔴 Critical Issues1. Security: API Key Exposure Risk (plot_generator.py:169-173)Location: api_key = os.getenv("ANTHROPIC_API_KEY")
if not api_key:
raise ValueError("ANTHROPIC_API_KEY environment variable not set")Issue: While the code correctly uses environment variables, the model ID is hardcoded: model="claude-sonnet-4-20250514"Concerns:
Recommendation: model = os.getenv("CLAUDE_MODEL", "claude-sonnet-4-20250514")
# Add retry logic with exponential backoff
from anthropic import APIError, RateLimitError
try:
response = client.messages.create(...)
except RateLimitError as e:
# Handle rate limiting
except APIError as e:
# Handle API errors2. Workflow Artifact Passing Issue (quality-check.yml:25-30)Location: - name: Download preview metadata
uses: actions/download-artifact@v4
with:
name: preview-metadata
run-id: ${{ github.event.workflow_run.id }}
github-token: ${{ secrets.GITHUB_TOKEN }}Issue: The Recommendation: Use a persistent storage mechanism (e.g., commit a file to the repo, or use GitHub cache) or restructure to use a single workflow with multiple jobs. 3. Code Injection Vulnerability (test-and-preview.yml:111)Location: MPLBACKEND=Agg uv run python "" 2>&1 | tee "preview_outputs/${SPEC_ID}_${LIBRARY}_${VARIANT}.log"Issue: Directly executing Python files from user-submitted PRs without sandboxing or validation is a major security risk. Malicious code could:
Critical Recommendation:
- name: Generate preview images
run: |
docker run --rm --network none --cpus="1.0" --memory="512m" \
-v $(pwd):/workspace -w /workspace \
python:3.12-slim bash -c "MPLBACKEND=Agg python "
|
Pull Request Review: Complete Automation Infrastructure🎯 OverviewThis PR adds comprehensive automation infrastructure for AI-powered plot generation. The approach is well-architected and aligns with the project's specification-first philosophy. Here's my detailed review: ✅ StrengthsArchitecture & Design
Code Quality
🔴 Critical Issues1. Security: API Key Exposure Risk (plot_generator.py:169-173)Location: While the code correctly uses environment variables, the model ID is hardcoded. This prevents easy model switching and has no API rate limiting or error handling for API calls. Recommendation: Make model configurable and add retry logic with exponential backoff for rate limiting and API errors. 2. Workflow Artifact Passing Issue (quality-check.yml:25-30)Location: The Recommendation: Use a persistent storage mechanism (e.g., commit a file to the repo, or use GitHub cache) or restructure to use a single workflow with multiple jobs. 3. Code Injection Vulnerability (test-and-preview.yml:111)Location: MPLBACKEND=Agg uv run python "$file" 2>&1 | tee "preview_outputs/${SPEC_ID}_${LIBRARY}_${VARIANT}.log"Issue: Directly executing Python files from user-submitted PRs without sandboxing or validation is a major security risk. Malicious code could:
Critical Recommendation: Run in a sandboxed environment (Docker container with no network access) with resource limits (CPU, memory, time), and only execute from trusted branches after review.
|
Addresses critical review feedback with 5 key improvements:
1. GCS Optional: Make GCS upload optional with proper error handling
- Add continue-on-error and conditional checks
- Workflows work without GCS credentials
2. Spec ID Extraction: Case-insensitive, flexible digit count
- Support 3-4 digits (001-9999)
- Convert to lowercase automatically
- Validate spec file existence
3. Code Syntax Validation: AST-based validation before saving
- New extract_and_validate_code() function
- Catches syntax errors early with retry
- Clear error messages
4. Path Parsing: Regex-based validation with error handling
- Replace fragile cut-based parsing
- Validate format: plots/{lib}/{type}/{spec}/{variant}.py
- Skip invalid files with warning
5. API Error Handling: Retry with exponential backoff
- Retry on RateLimitError and APIConnectionError
- 3 retries with 2s, 4s, 8s delays
- Don't retry on other API errors
Ready for initial testing without GCS setup.
🔍 Code ReviewThank you for this comprehensive PR adding automation infrastructure for plot generation! This is a well-structured implementation with good separation of concerns. Below is my detailed review: ✅ Strengths
🐛 Potential Bugs & IssuesHigh Priority
Medium Priority
Low Priority
🚀 Performance Considerations
🔒 Security Concerns
🧪 Test CoverageMissing Tests
Test Structure Recommendations# tests/unit/automation/generators/test_plot_generator.py
def test_extract_and_validate_code_success():
response = '''Here's the code:
```python
def create_plot():
pass
```'''
code = extract_and_validate_code(response)
assert "def create_plot" in code
def test_extract_and_validate_code_syntax_error():
with pytest.raises(ValueError, match="syntax errors"):
extract_and_validate_code("def invalid syntax")📝 Code Quality Suggestions
🎯 Recommendations SummaryMust Fix Before Merge
Should Fix Soon
Nice to Have
📊 Overall AssessmentCode Quality: ⭐⭐⭐⭐☆ (4/5) - Well structured with good patterns, minor issues Security: ⭐⭐⭐⭐⭐ (5/5) - Proper secret management, appropriate isolation Test Coverage: ⭐⭐☆☆☆ (2/5) - Missing unit tests for critical components Documentation: ⭐⭐⭐⭐☆ (4/5) - Good inline docs, comprehensive CLAUDE.md Performance: ⭐⭐⭐⭐☆ (4/5) - Reasonable for current scale, some optimization opportunities ✅ ConclusionThis is a solid foundation for the automation infrastructure! The architecture is well-designed and follows GitHub Actions best practices. The main concern is the lack of unit tests for critical components like Recommendation: After addressing the "Must Fix" items (especially the missing checkout step and adding basic tests), this PR will be ready to merge. Great work on the comprehensive automation setup! 🎉 |
…an-01QbVSDkhbSCB2w6qKJe59m5 feat: add complete automation infrastructure for plot generation
#1089) Fixes #967 ## Summary - Update quality threshold documentation from 85 to 90 - Add CLI retry logic for intermittent failures - Change issue lifecycle: spec-ready issues stay open until all implementations complete ## Changes ### Documentation (Issue Finding #4) Updates quality threshold from 85 to 90 in documentation to match actual workflow configuration in `impl-review.yml`. **Files:** - CLAUDE.md - README.md - docs/workflow.md - docs/concepts/claude-skill-plot-generation.md - prompts/quality-evaluator.md ### CLI Retry Logic (Issue Finding #3) Adds retry mechanism for Claude CLI steps to handle intermittent "Executable not found in $PATH" errors. Each Claude step now has `continue-on-error: true` and a retry step that runs if the first attempt fails. **Files:** - .github/workflows/spec-create.yml - .github/workflows/spec-update.yml - .github/workflows/impl-generate.yml - .github/workflows/impl-repair.yml - .github/workflows/util-claude.yml ### Issue Lifecycle (Bonus) - `spec-ready` issues now stay open until all 9 library implementations are merged - Changed `Closes #...` to `Related to #...` in spec PR body - Added auto-close logic in `impl-merge.yml` when all `impl:{library}:done` labels are present **Files:** - .github/workflows/spec-create.yml - .github/workflows/impl-merge.yml ## Test Plan - [ ] Verify documentation shows correct 90 threshold - [ ] Trigger a workflow and verify retry works on CLI failure - [ ] Create a test spec and verify issue stays open after merge - [ ] Verify issue closes when all 9 libraries have `impl:{library}:done`
- Fix #1: Handle missing remote branches in impl-generate.yml - Check if branch exists before checkout to avoid 'not a commit' errors - Fall back to creating fresh branch from main if remote doesn't exist - Fix #2: Clean up duplicate labels on failure - Remove both 'generate:X' and 'impl:X:pending' when marking as failed - Prevents label accumulation (e.g., both pending and failed) - Fix #3: Auto-close issues when done + failed = 9 - Previously only closed when all 9 were 'done' - Now closes when total (done + failed) reaches 9 - Shows which libraries failed in closing comment - Fix #4: Track generation failures and auto-mark as failed - Count previous failed runs for same spec/library - After 3 failures, mark as 'impl:X:failed' automatically - Posts failure comment explaining the library may not support this plot type
…1308) ## Summary Fixes several workflow issues discovered during batch processing of spec-ready issues. ### Fix #1: Branch-Not-Found Errors **Problem:** `fatal: 'origin/implementation/{spec}/{library}' is not a commit` errors when workflow tries to checkout a non-existent remote branch. **Solution:** Check if remote branch exists before checkout, fall back to creating fresh branch from main. ### Fix #2: Duplicate Labels **Problem:** Issues accumulate both `impl:X:pending` and `impl:X:failed` labels when generation fails. **Solution:** Failure handler removes both `generate:X` and `impl:X:pending` when marking as failed. ### Fix #3: Auto-Close with Failures **Problem:** Issues with 8 done + 1 failed stay OPEN because auto-close only triggers on 9 done. **Solution:** Close when `done + failed = 9`, with appropriate summary (shows which libraries failed). ### Fix #4: Generation Failure Tracking **Problem:** When `impl-generate` fails (no plot.png), no PR is created → no review → no repair → library stays `pending` forever. **Solution:** Track generation failures and mark as `impl:X:failed` after 3 consecutive failures. Posts comment explaining the library may not support this plot type. ## Files Changed - `.github/workflows/impl-generate.yml` - Fixes #1, #2, #4 - `.github/workflows/impl-merge.yml` - Fix #3 ## Testing - YAML syntax validated - Logic reviewed against observed failure patterns
Critical fixes: - Fix TypeError: flatten tag dict to list for repository call (#6) Changed search_by_tags to receive list[str] instead of dict[str, list[str]] Repository expects flat list of tag values, not nested dict - Add missing dataprep and styling parameters (#1, #7) Added to search_specs_by_tags function signature and docstring These categories were documented but not implemented - Add filter logic for dataprep and styling (#2) Implemented filtering checks similar to other impl-level tags Ensures new parameters actually filter results - Update condition to include dataprep and styling (#3) Modified impl-level filtering condition on line 117 Now checks all 6 impl-level tag categories Improvements: - Add database error handling with helpful messages (#8) Check is_db_configured() before all database operations Provides clear error message if DATABASE_URL not set - Update test mocks to match fixed interface (#5) Tests now verify flattened tag list instead of dict Added new test for dataprep/styling filter parameters Mock is_db_configured to return True in test fixture Verification: - All 16 unit tests passing - Ruff linting and formatting applied - No routing conflicts (#4 verified - no /mcp routes in routers) Related: PR #4132, Issue #4129
Summary
Adds complete automation infrastructure for AI-powered plot generation:
Workflows
Testing Plan
Phase 1 (No GCS required):
Phase 2 (After GCS setup):
Related issue: Initial automation infrastructure setup